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Abstract 

Given the joint chances of a pair of random variables one can compute quan- 
tities of interest, like the mutual information. The Bayesian treatment of 
unknown chances involves computing, from a second order prior distribution 
and the data likelihood, a posterior distribution of the chances. A common 
treatment of incomplete data is to assume ignorability and determine the 
chances by the expectation maximization (EM) algorithm. The two differ- 
ent methods above are well established but typically separated. This paper 
joins the two approaches in the case of Dirichlet priors, and derives efficient 
approximations for the mean, mode and the (co)variance of the chances and 
the mutual information. Furthermore, we prove the unimodality of the poste- 
rior distribution, whence the important property of convergence of EM to the 
global maximum in the chosen framework. These results are applied to the 
problem of selecting features for incremental learning and naive Bayes classifi- 
cation. A fast filter based on the distribution of mutual information is shown 
to outperform the traditional filter based on empirical mutual information on 
a number of incomplete real data sets. 
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1 Introduction 



Let 71"^ be the joint chances of a pair of random variables Many statistical 

quantities can be computes if 7r is known; for instance the mutual information I(tz) 
used for measuring the stochastic dependency of i and j. The usual procedure in the 
common case of unknown chances it^ is to use the empirical probabilities itij=nij/n 
as if they were precisely known chances. This is not always suitable: (a) The point 
estimate 7Ty does not carry information about the reliability of the estimate, (b) 
Samples may be incomplete in the sense that in some samples the variable i or 
j may not be observed. 

The Bayesian solution to (a) is to use a (second order) prior distribution p(iz) 
over the chances tt themselves, which takes account of uncertainty about 7r. From 
the prior p(n) and the likelihood p(D\ir) of data D one can compute the posterior 
p{iz\D). The traditional solution to (b) is to assume that the data are missing at 
idom [LR87J. A (local) maximum likelihood estimate for 7r can then be obtained 



rant 



by the expectation-maximization (EM) algorithm |CF74j . 



In this work we present a full Bayesian treatment of incomplete discrete data 
with Dirichlet prior p(ir) and apply the results to feature selection. This work is 
a natural continuation of ZH02J, which focused on the case of complete data and, 
by working out a special case, provided encouraging evidence for the extension of 
the proposed approach to incomplete data. Here we develop that framework by 
creating a very general method for incomplete discrete data, providing the complete 
mathematical derivations, as well as experiments on incomplete real data sets. In 
particular, Section |21 derives expressions (in leading order in 1/n) for p(ir\D). In 
the important case (for feature selection) of missingness in one component of (t,j) 
only, we give closed form expressions for the mode, mean and covariance of 7r. In 
the general missingness case we get a self-consistency equation which coincides with 
the EM algorithm, that is known to converge to a local maximum. We show that 
p{iz\D) is actually unimodal, which implies that in fact EM always converges to the 
global maximum. We use the results to derive in Section El closed- form leading order 
expressions of the distribution of mutual information p(I\D). In case of complete 
data, the mean and variance of I have been approximated numerically in Kle99j and 
analytically in |Hut02j . The results are then applied to feature selection in Section 
HI A popular filter approach discards features of low empirical mutual information 
I (it) |Lew921 IBT971 ICHH+02| . We compare this filter to the two filters (introduced 
in |ZH02| for complete data and tested empirically in this case) that use credible 
intervals based on p(I\D) to robustly estimate mutual information. The filters are 
empirically tested in Section |3] by coupling them with the naive Bayes classifier 
[DHS01 to incrementally learn from and classify incomplete data. On five real 
data sets that we used, one of the two proposed filters consistently outperforms the 
traditional filter. 



2 



2 Posterior Distribution for Incomplete Data 



Missing data. Consider two discrete random variables, class i and feature 1 j 
taking values in {l,...,r} and {l,...,s}, respectively, and an i.i.d. random process with 
samples G {l,...,r} x {l,...,s} drawn with joint probability 7Ty. In practice one 
often has to deal with incomplete information. For instance, observed instances often 
consist of several features plus class label, but some features may not be observed, i.e. 
if % is a class label and j is a feature, from the pair only i is observed. We extend 
the contingency table to include rij?, which counts the number of instances in 
which only the class i is observed (= number of (i,?) instances). Similarly, n?j counts 
the number of (?j) instances, where the class label is missing. We make the common 
assumption that the missing-data mechanism is ignorable (missing at random and 
distinct) |LR87j . i.e. the probability distribution of class labels i of instances with 
missing feature j is assumed to coincide with the marginal Similarly, 
given an instance with missing class label, the probability of the feature being j is 
assumed to be 7r +J - :=^i 7r «r 

Maximum likelihood estimate of n. The likelihood of a specific data set D 
of size N = n+n + 7+n? + with contingency table N = {riij ,ni? ,n?j} given w, hence, is 
p(D\n,n,n + 7 } n u ) = U ij ^ij j UiK+Uj 7r +j- Assuming a uniform p(n) ~ 1-5(tt ++ -1), 
Bayes' rule leads to the posterior 2 

p(n\D) =p(n\N) = — ^ II <? II <j ^++ " ^ C 1 ) 

ij i j 

where the normalization M is chosen such that Jp(Tz\N)d7T = 1. With missing 
features and classes there is no exact closed form expression for Af '. 

In the following, we restrict ourselves to a discussion of leading-order (in N^ 1 ) 
expressions, which are as accurate as one can specify one's prior knowledge [H ut02j . 
In leading order, the mean E[k] coincides with the mode of p(tt\N) (=the maximum 
likelihood estimate) of 7r. The log-likelihood function logp(7r|AT) is 

L(n\N) = ^ n ij lo S Kyi + Hil log n *+ + S n ^ log n+ * ~ lo gA/"(iV) - A(tt ++ - 1), 

ij i j 

where we have replaced the 5 function by a Lagrange multiplier A to take into 
account the restriction ic,, = 1. The maximum is at = — + — + — - — X = 0. 

OTTij TTij TTi+ TT+j 

1 The mathematical development is independent of the interpretation as class and feature, but 
it is convenient to use this terminology already here. 

2 Most (but not all) non- informative priors forp(7r) also lead to a Dirichlet posterior distribution 
(fT|) with interpretation riy = n 1 ^ +n"j — 1, where are the number of samples and n"j 

comprises prior information (1 for the uniform prior, | for Jeffreys' prior, for Haldane's prior, — 
for Perks' prior, and other numbers in case of specific prior knowledge [GCSR95] ). Furthermore, 
in leading order in l/N, any Dirichlet prior with n"j =0(1) leads to the same results, hence we 
can simply assume a uniform prior. The reason for the 5(n ++ — 1) is that tt must be constrained 
to the probability simplex ir ++ :=X^-t7 7r y = ^" 
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Multiplying this by 7Tjj and summing over i and j we obtain X = N. The maximum 
likelihood estimate 7r is, hence, given by 

1 / 7Ti 7 - TTij \ , s 

This is a non-linear equation in 7Ty, which, in general, has no closed form solution. 
Nevertheless (J2J) can be used to approximate tz^. Eq. (j2J) coincides with the popular 
expectation-maximization (EM) algorithm |GF74| if one inserts a first estimate 7r?- = 
2s. into the r.h.s. of (J2J) and then uses the resulting l.h.s. tz}, as a new estimate, etc. 



Unimodality of p(ir\N). The rsxrs Hessian matrix H e]R rs ' rs of — L and the 
second derivative in direction of the rs dimensional column vector v G M rs are 



o i n n x x , n ^ x i n? ix 

u n ij u n kl n ij n i+ n +j 



v Hv = ^ Vij H (ij){kl) v kl = ^—v^ + ^-^v^ + ^-^v^ > 0. 

ijkl ij l 3 i l + j +3 

This shows that — L is a convex function of 7r, hence p(ir\N) has a single (possibly 
degenerate) global maximum. L is strictly convex if njj>0 for all ij, since v T Hv>0 
\/vj^0 in this case 3 . This implies a unique global maximum, which is attained in 
the interior of the probability simplex. Since EM is known to converge to a local 
maximum, this shows, that in fact EM always converges to the global maximum. 



Covariance of 7r. With 



A (ij)(kl) '■= H {ij)(kl)[n] = N 



5ik5ji 5ik 5ji 

— H h — 

Pij Pn p?j _ 



A 2 ~ 2 ~ 2 

P,r- V^. p J? :=iV^±, pjj-^N^. (3) 

rijj n i? n?j 

and A:=7r — 7r we can represent the posterior to leading order as an rs — 1 dimen- 
sional Gaussian: 

p(7r\N)~e-^ TAA 6(A ++ ). (4) 

The easiest way to compute the covariance (and other quantities) is to also rep- 
resent the ^-function as a narrow Gaussian of width e ~ 0. Inserting 5(A ++ ) « 
^^exp(— 2^-A T ee T A) into (j3J), where e^- = l for all (hence e T A = A ++ ), leads 

to a full r,s- dimensional Gaussian with kernel A = A + uv T , w = u = ie. The co- 
variance of a Gaussian with kernel A is A . Using the Sherman- Morrison formula 



3 Note that ni?>0 Vi is not sufficient, since Vi + =0 for v^O is possible. Actually w ++ = 0. 
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A 1 = A' 1 - A' 1 -^Jl-^ A- 1 (PFTV921 p73] and e->0 we get 

A- 1 ee T A- 1 



Cov( i3 -)( W )[7r] := E[AijA 



ki 



^\A \ 



A" 1 



(5) 



J (ij)(ki) 



where ~ denotes = up to terms of order N~ 2 . Singular A are easily avoided by 
choosing a prior such that riy>0 for all ij. A may be inverted exactly or iteratively, 
the latter by a trivial inversion of the diagonal part StkSji/pij and by treating 5i k /pi?+ 
5ji/p?j as a perturbation. 



Missing features only, no missing classes. In the case of missing features 
only (no missing classes), i.e. for n?j — 0, closed form expressions for Cov["7r] can 



be obtained. If we sum (J2J) over j we get 7r i+ = with N i+ :=n i+ +nj?. Inserting 
Tfi+ = ^- into the r.h.s. of (0) and solving w.r.t. tt^ we get the explicit expression 



7T 



N i+ n, 



ki 



N m 



(6) 



Furthermore, it can easily be verified (by multiplication) that A^j-j^i) — N[5 ik 5ji/ pij + 



fiik/pn] has inverse [A 1 



(ij)(kl) N 



jj[pijbikbji 



Pi++Pi! 



§ik\- With the abbreviations 



Pn 



pn + pi+ 



and Q := ^^Pi+Qi? 



(7) 



we get [A 1 e] ij =Yjki[A 1 }(ij)(ki) = jfPijQi?&V-de T A 1 e = Q/N. Inserting everything 
into (0) we get 



Cov ( i i)(w) [7r] ~ — 



PijPkl x PijQi?PklQk? 

— ; — °ik ~ 



Pi++Pi 



Q 



Expressions for the general case. The contribution from unlabeled classes can 
be interpreted as a rank s modification of A in the case of no missing classes. One can 
use Wood bury's formula [B+UDV T ] 1 ^B- 1 -B- 1 U[D~ 1 +V T B- 1 U]- 1 V T B 1 
jPKTV921 p75] with B {ij){k i ) = 5 ik 5 j i/ 'pij+S ik / 'p i? , D i 5ji/p?j, and U ^ i:j) i = V ^ = 5^ 
to reduce the inversion of the rs x rs matrix A to the inversion of only a single s- 
dimensional matrix. The result (which may be inserted into (jSJ)) can be written in 
the form 

Fijl&ik ^ Fjjm \G \mnFkln j (9) 



[a- 1 : 



(ij)(ki) 



N 



Pi?+pi+ 
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3 Distribution of Mutual Information 



Mutual information /. An important measure of the stochastic dependence of 
i and j is the mutual information 



/ ( 7r ) = 52 52 lo S — = 52 n ij lo S *H - 52 ^+ lo S *i+ - 52 7r +J lo S 7r +i • 

i=l j=l J ij i j 

The point estimate for / is I (it). In the Bayesian approach one takes the posterior 
(JTJ from which the posterior probability density of the mutual information can, in 
principle, be computed: 4 

p(I\N) = J 5(I(ir) - I)p(n\N)d rs ir. (10) 

5 The S(-) distribution restricts the integral to 7r for which I(n) = I. For large sample 
size, N—>oo, p(ir\N) is strongly peaked around the mode 7r = 7r and p(I\N) gets 
strongly peaked around the frequency estimate I— I (it). The (central) moments of 
/ are of special interest. The mean 

/CO p 
I-p(I\N)dI = J I(ir)p(ir\N)d rs ir = I(ir) + 0(N- 1 ) (11) 

coincides in leading order with the point estimate, where it has been computed 
in Section El Together with the variance Var[J] = E[(I-E[I}) 2 } = E[I 2 ]-E[I} 2 
(computed below) we can approximate (fTU|) by a Gaussian 6 

P(W~exp(-^)~exp(-itf4E) (12) 

In a previous work we derived higher order central moments (skewness and kurtosis) 
and higher order (in iV -1 ) approximations in the case of complete data |Hut 02j. 



Variance of /. The leading order variance of the mutual information I(ir) has 
been related 7 in [Hut02 to the covariance of it: 

Var[I] ~ Vlog ; ^ log ; Cov fa - )(fc ;)[7r] (13) 

W 7T l+ 7T +i 7C k+ n +l 

4 I(tt) denotes the mutual information for the specific chances 7T, whereas / in the context 
above is just some non-negative real number. / will also denote the mutual information random 
variable in the expectation E[I] and variance Var[i]. Expectations are always w.r.t. to the posterior 
distribution p(ir\N). 

5 Since 0<I(ir)<I max with sharp upper bound I m ax =min{logr,logs}, the domain of p(I\n) is 
[OJmax], and integrals over / may be restricted to J '" ax . 

6 For I(it) t^O the central limit theorem ensures convergence of p(I\N) to a Gaussian. Using a 
Beta distribution instead of (|12|l . which also converges to a Gaussian, has slight advantages over 
(P2J [ZHT)2] . 

7 it was defined in |Hut02j as the mean E[w] whereas 7r has been defined in this work as the 
ML estimate. Furthermore the Dirichlet priors differ. Since to leading order both definitions of 7r 
coincide, the prior does not matter, and the expression is also valid for incomplete data case, the 
use of in this work is permitted. 
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Inserting (JHJ) for the covariance into (|13|) we get for the variance of the mutual 
information in leading order in 1 /N in the case of missing features only, the following 
expression: 

Var[J] *hR- J 2 /Q - P], K:=J2 P» ( lo S ' > ( 14 ) 



pi? ^ Ki+K+j 

A closed form expression for Af(N) also exists. Symmetric expressions for missing 
classes only (no missing features) can be obtained. Note that for the complete case 
n?j = rii? = 0, we have 7Ty = p^- — Pi? = oo, Qi? = 1, J = J, K = K, and P = 0, 
consistent with Hut02] (where J and K are defined and the accuracy is discussed). 



There is at least one reason for minutely having inserted all expressions into 
each other and introducing quite a number definitions. In the so presented form all 
expressions involve at most a double sum. Hence, the overall computation time of 
the mean and variance is 0{rs) in the case of missing features only. 

Expression for the general case. The result for the covariance (0) can be in- 
serted into (|13|) to obtain the variance of the mutual information to leading order. 



Var[J] ~ l T A~H - (l T A- 1 e) 2 /(e T A- 1 e) where Uj = log 



7Ty 



TC i+ TC +j 



Inserting and rearranging terms appropriately we can compute Var[J] in time 
0{rs) plus the time 0(s 2 r) to compute the sxs matrix G and time 0(s 3 ) to invert 
it, plus the time O(^rs) for determining it^, where # is the number of iterations of 
EM. Of course, one can and should always choose s<r. Note that these expressions 
converge for N — > oo to the exact values. The fraction of data with missing feature 
or class needs not to be small. 

In the following we apply the obtained results to feature selection for incomplete 
data. Since we only used labeled data we could use (|TT|) with (0) , and with (J7J) 
and ©• 



4 Feature Selection 

Feature selection is a basic step in the process of building classifiers [BL97 . We con- 
sider the well-known filter (F) that computes the empirical mutual information I (it) 
between features and the class, and discards features with I (it) <e for some thresh- 
old e |Lew92j . This is an easy and effective approach that has gained popularity 
with time. 

We compare F to the two filters introduced in |ZH02| for the case of complete 
data, and extended here to the more general case. The backward filter (BF) discards 
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a feature if its value of mutual information with the class is less than or equal to e 
with high probability p (discard if p(I <s\N) >p). The forward filter (FF) includes 
a feature if the mutual information is greater than e with high probability p (include 
if p(I > e\N) >p). BF is a conservative filter, because it will only discard features 
after observing substantial evidence supporting their irrelevance. FF instead will 
tend to use fewer features, i.e. only those for which there is substantial evidence 
about them being useful in predicting the class. 

For the subsequent classification task we use the naive Bayes classifier [D H73j . 
which is often a good classification model. Despite its simplifying assumptions (see 
[DP97J), it often competes successfully with much more complex classifiers, such 
as C4.5 | Qui93 . Our experiments focus on the incremental use of the naive Bayes 



classifier, a natural learning process when the data are available sequentially: the 
data set is read instance by instance; each time, the chosen filter selects a subset 
of features that the naive Bayes uses to classify the new instance; the naive Bayes 
then updates its knowledge by taking into consideration the new instance and its 
actual class. Note that for increasing sizes of the learning set the filters converge to 
the same behavior, since the variance of / tends to zero (see [ZH02J for details). 

For each filter, we are interested in experimentally evaluating two quantities: for 
each instance of the data set, the average number of correct predictions (namely, 
the prediction accuracy) of the naive Bayes classifier up to such instance; and the 
average number of features used. By these quantities we can compare the filters and 
judge their effectiveness. 

The implementation details for the following experiments include: using the 
Gaussian approximation ()12j) to the distribution of mutual information with the 
mean (JTTJ) using (jHJ) , and the variance using (jJJ) and (jHJ) ; using natural logarithms 
everywhere; and setting the level p of the posterior probability to 0.95, and the 
threshold e to 0.003 as discussed in |ZH02j. 



5 Experimental Analysis 

Table Q lists five data sets together with the experimental results. These are real 
data sets on a number of different domains. The data sets presenting non-nominal 
features have been pre-discretized by MLC++ |KJL + 94| . default options (i.e., the 
common entropy based discretization). This step may remove some features judging 
them as irrelevant, so the number of features in the table refers to the data sets after 
the possible discretization. The instances have been randomly sorted before starting 
the experiments. 

The last three columns of Table^show that FF selects lower (i.e. better) number 
of features than the commonly used filter F, which in turn, selects lower number 
of features than the filter BF. We used the two-tails paired t test at level 0.05 to 
compare the prediction accuracies of the naive Bayes with different filters, in the 
first k instances of the data set, for each k. On four data sets out of five, both the 
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Table 1: Incomplete data sets used for the experiments, together with their number of 
features, instances, missing values, and the relative frequency of the majority class. The 
data sets are available from the UCI repository of machine learning data sets jMA 95l. 
Average number of features selected by the filters on the entire data set are reported in 
the last three columns. FF always selected fewer features than F; F almost always selected 
fewer features than BF. Prediction accuracies where significantly different only for the 
Hypothyroidloss data set. 



Name 


#feat. 


#inst. 


#m.v. 


maj. class 


FF 


F 


BF 


Audiology 


69 


226 


317 


0.212 


64.3 


68.0 


68.7 


Crx 


15 


690 


67 


0.555 


9.7 


12.6 


13.8 


Horse-colic 


18 


368 


1281 


0.630 


11.8 


16.1 


17.4 


Hypothyroidloss 


23 


3163 


1980 


0.952 


4.3 


8.3 


13.2 


Soybean-large 


35 


683 


2337 


0.135 


34.2 


35 


35 



differences between FF and F, and the differences between F and BF, were never 
statistically significant, despite the different number of used features, as indicated 
in Table [T] The reduction can be very pronounced, as for the Hypothyroidloss data 
set. This is also the only data set for which the prediction accuracies of F and 
FF are significantly different, in favor of the latter. This is displayed in Figure 
Similar (even stronger) results have been found for 10 complete data sets analyzed 
in |ZH02] . 

The most prominent evidence from the experiments is the better performance 
of FF versus the traditional filter F. In the following we look at FF from another 
perspective to exemplify and explain its behavior. FF includes a feature if p(I > 
e\n) >p, according to its definition. Let us assume that FF is realized by means 
of the Gaussian (as in the experiments above), and let us choose p~ 0.977. The 
condition p(I > e\n) > p becomes or, in an approximate way, 

given that /(tt) is the first-order approximation of E[I\ (cf. 
flllj) ) . We can reg ard£+2yVar[J] as a new threshold e'. Under this interpretation, 
we see that FF is approximately equal to using the filter F with the bigger threshold 
e' . This interpretation makes it also clearer why FF can be better suited than F for 
sequential learning tasks. In sequential learning, Var[J] decreases as new units are 
read; this makes e' to be a self-adapting threshold that adjusts the level of caution 
(in including features) as more units are read. In the limit, e' is equal to e. This 
characteristic of self-adaptation, which is absent in F, seems to be decisive to the 
success of FF. 

6 Conclusions 

We addressed the problem of the reliability of empirical estimates for the chances w 
and the mutual information / in the case of incomplete discrete data. We used the 
Bayesian framework to derive reliable and quickly computable approximations for 
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ocoa>c\iLocoT-'*r^o 

t — t — i — CM CM C\J CO 

Instance number 

Figure 1: Prediction accuracies of the naive Bayes with filters F and FF on the Hypothy- 
roidloss data set. BF is not reported because there is no significant difference with the 
F curve. The differences between F and FF are significant in the range of observations 
71-374 (white area). The maximum difference is achieved at observation 71, where the 
accuracies are 0.986 (FF) vs. 0.930 (F). 

the mean, mode and the (co)variance of 7r and J(7r) under the posterior distribu- 
tion p(tz\D). We showed that p(7r\D) is unimodal, which implies that EM always 
converges to the global maximum. The results allowed us to efficiently determine 
credible intervals for / with incomplete data. Applications are manifold, e.g. to 
robustly infer classification trees or Bayesian networks. As far as feature selection 
is concerned, we empirically showed that the forward filter, which includes a feature 
if the mutual information is greater than e with high probability, outperforms the 
popular filter based on empirical mutual information in sequential learning tasks. 
This result for incomplete data is obtained jointly with the naive Bayes classifier. 
More broadly speaking, obtaining the distribution of mutual information when data 
are incomplete may form a basis on which reliable and effective uncertain models 
can be developed. 
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